CORD-19 数据集

CORD-19 集合中包含关于 COVID-19、SARS-CoV-2 和相关冠状病毒的 50,000 多篇学术文章(其中超过 40,000 篇提供了全文)。 此数据集已免费提供,目的是帮助研究界对抗 COVID-19 疫情。


演示如何在 Azure 上访问 CORD-19 数据集:连接到存储 CORD-19 数据集的 Azure blob 存储帐户。 演练数据集的结构:数据集中的文章存储为 JSON 文件。 我们提供了示例,其中展示了: 如何查找文章(导航容器) 如何阅读文章(导航 JSON 架构)


Azure 存储(例如 pip install azure-storage) NLTK(文档) Pandas(例如 pip install pandas) 从 Azure 获取 CORD-19 数据

CORD-19 数据已作为 Azure 开放数据集上传到此处。 我们将创建一个链接到此 CORD-19 开放数据集的 blob 服务。

from import BlockBlobService # storage account details azure_storage_account_name = "azureopendatastorage" azure_storage_sas_token = "sv=2019-02-02&ss=bfqt&srt=sco&sp=rlcup&se=2025-04-14T00:21:16Z&st=2020-04-13T16:21:16Z&spr=https&sig=JgwLYbdGruHxRYTpr5dxfJqobKbhGap8WUtKFadcivQ%3D" # create a blob service blob_service = BlockBlobService( account_name=azure_storage_account_name, sas_token=azure_storage_sas_token, )

我们可以使用此 blob 服务作为数据的句柄。 我们可以利用 BlockBlobService API 导航数据集。 请参阅下面的内容了解详细信息:

Blob 服务概念 对容器执行的操作

CORD-19 数据存储在 covid19temp 容器中。 下面是容器中的文件结构以及示例文件。

metadata.csv custom_license/ pdf_json/ 0001418189999fea7f7cbe3e82703d71c85a6fe5.json # filename is sha-hash ... pmc_json/ PMC1065028.xml.json # filename is the PMC ID ... noncomm_use_subset/ pdf_json/ 0036b28fddf7e93da0970303672934ea2f9944e7.json ... pmc_json/ PMC1616946.xml.json ... comm_use_subset/ pdf_json/ 000b7d1517ceebb34e1e3e817695b6de03e2fa78.json ... pmc_json/ PMC1054884.xml.json ... biorxiv_medrxiv/ # note: there is no pmc_json subdir pdf_json/ 0015023cc06b5362d332b3baf348d11567ca2fbb.json ...

每个 .json 文件对应于数据集中的一篇文章。 标题、作者、摘要和(如适用)全文数据都存储在这里。

使用 metadata.csv

CORD-19 数据集附带一个 metadata.csv,这个文件会记录有关 CORD-19 数据集中提供的所有论文的基本信息。 建议从这里开始探索!

# container housing CORD-19 data container_name = "covid19temp" # download metadata.csv metadata_filename = 'metadata.csv' blob_service.get_blob_to_path( container_name=container_name, blob_name=metadata_filename, file_path=metadata_filename ) import pandas as pd # read metadata.csv into a dataframe metadata_filename = 'metadata.csv' metadata = pd.read_csv(metadata_filename) metadata.head(3)


simple_schema = ['cord_uid', 'source_x', 'title', 'abstract', 'authors', 'full_text_file', 'url'] def make_clickable(address): '''Make the url clickable''' return '{0}'.format(address) def preview(text): '''Show only a preview of the text data.''' return text[:30] + '...' format_ = {'title': preview, 'abstract': preview, 'authors': preview, 'url': make_clickable} metadata[simple_schema].head().style.format(format_) # let's take a quick look around num_entries = len(metadata) print("There are {} many entries in this dataset:".format(num_entries)) metadata_with_text = metadata[metadata['full_text_file'].isna() == False] with_full_text = len(metadata_with_text) print("-- {} have full text entries".format(with_full_text)) with_doi = metadata['doi'].count() print("-- {} have DOIs".format(with_doi)) with_pmcid = metadata['pmcid'].count() print("-- {} have PubMed Central (PMC) ids".format(with_pmcid)) with_microsoft_id = metadata['Microsoft Academic Paper ID'].count() print("-- {} have Microsoft Academic paper ids".format(with_microsoft_id)) There are 51078 many entries in this dataset: -- 42511 have full text entries -- 47741 have DOIs -- 41082 have PubMed Central (PMC) ids -- 964 have Microsoft Academic paper ids 示例:阅读全文

metadata.csv 不包含全文本身。 我们来看一个如何阅读全文的示例。 找到并解压缩全文 JSON,并将其转换为句子的列表。

# choose a random example with pdf parse available metadata_with_pdf_parse = metadata[metadata['has_pdf_parse']] example_entry = metadata_with_pdf_parse.iloc[42] # construct path to blob containing full text blob_name = '{0}/pdf_json/{1}.json'.format(example_entry['full_text_file'], example_entry['sha']) # note the repetition in the path print("Full text blob for this entry:") print(blob_name)

现在,我们可以读取与此 blob 关联的 JSON 内容,如下所示。

import json blob_as_json_string = blob_service.get_blob_to_text(container_name=container_name, blob_name=blob_name) data = json.loads(blob_as_json_string.content) # in addition to the body text, the metadata is also stored within the individual json files print("Keys within data:", ', '.join(data.keys()))

在本例中,我们感兴趣的是 body_text,它按如下方式存储文本数据:

"body_text": [ # list of paragraphs in full body { "text": , "cite_spans": [ # list of character indices of inline citations # e.g. citation "[7]" occurs at positions 151-154 in "text" # linked to bibliography entry BIBREF3 { "start": 151, "end": 154, "text": "[7]", "ref_id": "BIBREF3" }, ... ], "ref_spans": , # e.g. inline reference to "Table 1" "section": "Abstract" }, ... ]

此处提供了完整的 JSON 架构。

from nltk.tokenize import sent_tokenize # the text itself lives under 'body_text' text = data['body_text'] # many NLP tasks play nicely with a list of sentences sentences = [] for paragraph in text: sentences.extend(sent_tokenize(paragraph['text'])) print("An example sentence:", sentences[0]) PDF 与 PMC XML 分析

在上面的示例中,我们看到了一个使用 has_pdf_parse == True 的示例。 其中,blob 文件路径采用了如下格式:


或者,对于使用 has_pmc_xml_parse == True 的示例,使用了以下格式:



# choose a random example with pmc parse available metadata_with_pmc_parse = metadata[metadata['has_pmc_xml_parse']] example_entry = metadata_with_pmc_parse.iloc[42] # construct path to blob containing full text blob_name = '{0}/pmc_json/{1}.xml.json'.format(example_entry['full_text_file'], example_entry['pmcid']) # note the repetition in the path print("Full text blob for this entry:") print(blob_name) blob_as_json_string = blob_service.get_blob_to_text(container_name=container_name, blob_name=blob_name) data = json.loads(blob_as_json_string.content) # the text itself lives under 'body_text' text = data['body_text'] # many NLP tasks play nicely with a list of sentences sentences = [] for paragraph in text: sentences.extend(sent_tokenize(paragraph['text'])) print("An example sentence:", sentences[0]) Full text blob for this entry: custom_license/pmc_json/PMC546170.xml.json An example sentence: Double-stranded small interfering RNA (siRNA) molecules have drawn much attention since it was unambiguously shown that they mediate potent gene knock-down in a variety of mammalian cells (1). 直接循环访问 blob

在上述示例中,我们使用了 metadata.csv 文件导航数据、构造 blob 文件路径并从 blob 读取数据。 有一种替代方法是循环访问 blob 本身。

# get and sort list of available blobs blobs = blob_service.list_blobs(container_name) sorted_blobs = sorted(list(blobs), key=lambda e:, reverse=True)

现在,我们可以直接循环访问 blob。 例如,让我们来计算可用的 JSON 文件数。

# we can now iterate directly though the blobs count = 0 for blob in sorted_blobs: if[-5:] == ".json": count += 1 print("There are {} many json files".format(count)) There are 59784 many json files 附录 数据质量问题

这是一个大型数据集,由于明显的原因,它在仓促的情况下被放在一起! 下面是我们观察到的一些数据质量问题。

多个 sha

我们观察到,在某些情况下,给定条目有多个 sha。

metadata_multiple_shas = metadata[metadata['sha'].str.len() > 40] print("There are {} many entries with multiple shas".format(len(metadata_multiple_shas))) metadata_multiple_shas.head(3) There are 1999 many entries with multiple shas 容器的布局


container_name = "covid19temp" blobs = blob_service.list_blobs(container_name) sorted_blobs = sorted(list(blobs), key=lambda e:, reverse=True) import re dirs = {} pattern = '([\w]+)\/([\w]+)\/([\w.]+).json' for blob in sorted_blobs: m = re.match(pattern, if m: dir_ = m[1] + '/' + m[2] if dir_ in dirs: dirs[dir_] += 1 else: dirs[dir_] = 1 dirs



